Merged
Conversation
30c3782 to
4085a61
Compare
17943af to
fd4748e
Compare
fd4748e to
ddb8d1b
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
此PR改善了字符集检测功能,主要包括三个方面的改进:
Changes:
- 增强了charset检测算法,采用文本重复性测试来判断最佳编码
- 新增了UTF-32LE和UTF-32BE编码支持(原生TextDecoder不支持)
- 改进了单元测试,使用更真实的脚本代码作为测试数据
Reviewed changes
Copilot reviewed 3 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/pkg/utils/encoding.ts | 新增decodeUTF32和bytesDecode函数,重构detectEncoding逻辑 |
| src/pkg/utils/encoding.test.ts | 大幅扩展测试用例,使用真实场景数据 |
| src/pages/install/App.tsx | 使用新的bytesDecode函数替代TextDecoder |
| package.json | 添加iconv-lite依赖用于测试 |
| pnpm-lock.yaml | 依赖锁文件更新 |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
| if (!decodedText) continue; | ||
| if (!highestConfidence) { | ||
| highestConfidence = entry.confidence; | ||
| if (highestConfidence > 90) return encoding; |
There was a problem hiding this comment.
检测到的编码应该返回,而不是立即终止循环。当前的逻辑在第一次成功解码后就返回,但根据后续代码逻辑,应该继续分析所有可能的编码,然后选择最佳的。建议将 return encoding; 改为 continue; 或移除这行。
Suggested change
| if (highestConfidence > 90) return encoding; | |
| if (highestConfidence > 90) continue; |
Member
|
没看明白你的自家理论,是根据可性度去处理? 这个场景应该也不会遇到很离谱的编码问题,不过这么完善的处理也行吧 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
charset 判断并不容易
现存的node跟browser通用的 js lib 好像只有这个 chardet 比较理想。
在这个基础上引入我自家的理论,能准确有效判断charset.
我加了测试。如果你用原 chardet 的 detect 是无法准确判断出正确charset.
原理不解释了。你问问 copilot 吧
(我的自家理论,AI生成不出来,但应该能看懂)
原生 TextDecoder 未支持 utf-32le utf-32be
透过手动转换 (LE直接DataView转换,BE要编译)
整合至 bytesDecode
实际 detect 对像是脚本代码
你只用几个 byte 测试肯定什么都试不出来
至少给它一句句子